May 14th 2020

Project requirements

Follow the IMRAD standard scientific structure: - Introduction - Materials and Methods - Results (And) - Discussion With a technical focus, but minding to communicate which-ever biological insights you arrived at

Should not include all your code (we will look into that at the individual examinations), but rather focus on the broader picture of what you did and include data summaries and visualisations

Created using ioslides_presentation rmarkdown (i.e. the right-most doc column in the project organisation will be a rmarkdown based presentation)

Introduction

  • Intro to snake venom
  • Data set for the study:
    • Venom compositions from snakes all around the world
  • Goal of study:
    • Group snakes by genus based on venom composition (PCA, K-means, ANN)

The dataset

  • Main data
  • New data, adding three snakes from litterature

Materials and methods

  • Talk about how the data sets are merged
  • Talk about which methods have been used to reach the goal for the study (PCA, K-means, ANN)

Results from cleaning and augmenting the data

  • Show dirty data vs. clean data
  • Show region, family, genus, species rows
  • Show grouping of the toxin families columns

Augmented data

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Snake = col_character(),
##   Genus = col_character(),
##   Species = col_character(),
##   Reference = col_character(),
##   Country = col_character(),
##   Continent = col_character(),
##   Family = col_character()
## )
## See spec(...) for full column specifications.

World map

Snake family count

Snake family

Most abundant toxins

Compare two snakes

Intra species comparison

Shiny app

Results from PCA and K-means

Prediction models based on venom composition

  • Prediction of snake family
    • Vanilla ANN with 4 hidden neurons achived 100 % test accuracy
  • Prediction of which continent, the snake originated from
    • Attempted a number of architechtures ranging from 1 to 4 hidden layers, with/without dropout and tried optimizing hyperparameters.
    • Problems with overfitting - further regularization e.g. early stopping might be a solution
    • Question might be ill posed - venom composition predicts snake family and not necessarily location. e.g. snakes from two different families both from the same country have completely different venom compositions.

Training of ANN predicting snake family

20406080100120140012345678910lossval_loss
204060801001201400.20.30.40.50.60.70.80.91accval_acc